arXiv:1503.01245vl [math.ST] 4 Mar 2015 


1 


Large Dimensional Analysis of Robust 
M-Estimators of Covariance with Outliers 

David Morales-Jimenez*, Romain CouillcT, Matthew R. McKay* 


Abstract —A large dimensional characterization of robust 
M-estimators of covariance (or scatter) is provided under the 
assumption that the dataset comprises independent (essentially 
Gaussian) legitimate samples as well as arbitrary deterministic 
samples, referred to as outliers. Building upon recent random 
matrix advances in the area of robust statistics, we specifically 
show that the so-called Maronna M-estimator of scatter asymp¬ 
totically behaves similar to well-known random matrices when 
the population and sample sizes grow together to infinity. The 
introduction of outliers leads the robust estimator to behave 
asymptotically as the weighted sum of the sample outer products, 
with a constant weight for all legitimate samples and different 
weights for the outliers. A fine analysis of this structure reveals 
importantly that the propensity of the M-estimator to attenuate 
(or enhance) the impact of outliers is mostly dictated by the 
alignment of the outliers with the inverse population covariance 
matrix of the legitimate samples. Thus, robust M-estimators can 
bring substantial benefits over more simplistic estimators such 
as the per-sample normalized version of the sample covariance 
matrix, which is not capable of differentiating the outlying 
samples. The analysis shows that, within the class of Maronna’s 
estimators of scatter, the Huber estimator is most favorable for 
rejecting outliers. On the contrary, estimators more similar to 
Tyler’s scale invariant estimator (often preferred in the literature) 
run the risk of inadvertently enhancing some outliers. 

Index Terms —Robust statistics, M-estimation, outliers. 

I. Introduction 

The growing momentum of big data applications along 
with the recent advances in large dimensional random matrix 
theory have raised much interest for problems in statistics 
and signal processing under the assumption of large but 
similar population dimension N and sample size n. Due to 
the intrinsic complexity of large dimensional random matrix 
theory, as compared to classical statistics where N is fixed 
and n —> oo, most of the classical applications were con¬ 
cerned with sample covariance matrix (SCM) based methods 
(as in e.g., |1, 2] for source detection or y| for subspace 
estimation). Only recently have other random matrix structures 
started to be explored which are adequate to deal with more 
advanced statistical problems; see for instance |@1 on Toeplitz 
random matrix structures, or 0 on kernel random matrices. 
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Of particular interest is the structure of robust M-estimators 
of covariance (or scatter), which have very recently come to 
a better understanding in the large dimensional regime and is 
the focus of the present work. 

The field of robust M-estimation, born with the early works 
of Huber @], roughly consists in improving classical Gaussian 
maximum-likelihood estimators, such as the sample mean or 
SCM, into estimators that (unlike the classical estimators) 
are resilient to both the possibly heavy-tailed nature of the 
observed data or the presence of outliers in the dataset. Assum¬ 
ing observation data of known zero mean, robust estimators 
of the population covariance matrix, referred to as robust 
M-estimators of scatter, were proposed successively in jcj] for 
data composed of a majority of independent Gaussian samples 
and a few outliers and then in [Q] and IH] for elliptic ally 
distributed or arbitrary scaled Gaussian data. 

But the analysis for each given N, n of the aforemen¬ 
tioned robust estimators of scatter, which often take the form 
of solutions of implicit equations, is in general intractable. 
In a series of recent works ss (see also 10 Q for 
applications), this limitation was alleviated by considering 
the random matrix regime where both N, n are large and 
commensurable. These works have shown that in this regime 
several classes of robust estimators of scatter (Maronna, Tyler, 
and regularized Tyler) behave similar to simpler and explicit 
random matrix models, which are fully understandable via 
(now standard) random matrix methods. Nonetheless, all these 
works were pursued under the assumption that the input data 
are independent and follow a zero-mean elliptical distribution. 
One of the salient outcomes of these works is that, under 
elliptical inputs, the Tyler and regularized Tyler estimators 
asymptotically behave similar to the SCM of the normalized 
dataU henceforth referred to as the normalized SCM, and 
therefore do not provide any apparent gain in robustness versus 
simpler sample covariance estimators. 

This fact, however, fundamentally disregards the important 
role of robust estimators as arbitrary outlier rejectors. In 
the present work, we shall consider data comprising both 
legitimate data (that are essentially independent Gaussian 
samples) and a certain (a priori unknown) amount of arbitrary 
deterministic outliers. Focusing our attention specifically to the 
(larger) class of Maronna’s M-estimators of scatter, similar to 
all of the aforementioned works and following the approach 
in |@], we will show that in this setting the robust estimator 
of scatter behaves similar for large N,n to an explicit and 
easily understood random matrix. But it will appear, unlike in 

1 This being valid up to second-order fluctuations 0. 
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§JH, that this random matrix no longer behaves similar to 
the normalized SCM. Our main finding is that, under suitable 
conditions, the robust estimator of scatter manages to attenuate 
(to some extent) the impact of the deterministic outliers, 
which the SCM (or normalized SCM) may not be capable of. 
Calling C n the population covariance matrix of the legitimate 
data, a; £ C ;V the v'-th outlier, and assuming the number 
of outliers is small compared to n, it will be demonstrated 
that the rejection power of the robust estimator of scatter is 
monotonically related to the quadratic form a jC v ' a,. This 
shows that, if C ; v is (invertible but) essentially of low rank, 

4 - _I ^ ( 

a- C ;V a,; can take large values and thus a,; is likely to be 
suppressed. If a'j Cjjr 1 a, is quite small instead, an inverse effect 
of outlier enhancement may appear that needs be controlled 
by an appropriate choice of estimator within Maronna’s class. 
We shall show that such an estimator should resemble the 
original Huber estimator from |@] and substantially differ from 
the Tyler estimator. 

In the remainder of the article, we provide a rigorous state¬ 
ment of our main results. The problem at hand is discussed 
in Section [IT] and our main results introduced in Section [III] 
all proofs being deferred to the appendices. Special attention 
will then be made on the analytically tractable cases where the 
number of outliers is either small (Section ITVI i or random i.i.d. 
(Section [V}. Concluding remarks are provided in Section PvTl 

Notations: The superscript (-)t stands for Hermitian trans¬ 
pose in the complex case or transpose in the real case. The 
norm || ■ || is the spectral norm for matrices and the Euclidean 
norm for vectors. The Dirac measure at point x is denoted 
5 X and 1.4 stands for the indicator function with A the 
corresponding inclusion event. The imaginary unit is denoted 
1 = y/—l and 3 [-] stands for the imaginary part. The set R + 
is defined as {x : x > 0 } and C + = {z £ C, 3 [z] > 0 }. The 
support of a distribution function F is denoted by Supp(T 1 ). 
The ordered eigenvalues of a Hermitian (or symmetric) matrix 
X of size N x N are denoted Ai(X) < ... < Ajv(X). For 
A, B Hermitian, A >- B means that A—B is positive definite. 
The notation diag(X) stands for the diagonal matrix composed 
of the diagonal elements of matrix X and diag(x) the diagonal 
matrix composed of the elements of vector x on the diagonal. 
The arrow —A designates almost sure convergence and => 
stands for weak convergence. 

II. System Model and Motivation 

For e n £ K such that ne n £ { 1 ,..., n}, let 

Y = [yi,...,y(i_ en ) n ,ai,...,a e „ n ] £ 

where y* = C]^ 2 Xi £ C N , i = 1,..., (1 — e n )n, are 

independent across i, Cat £ C NxN is deterministic Hermitian 
positive definite, and x, : has zero mean, unit variance and 
finite (8 + ?y)-th order moment entries for some rj > 0, while 
ai,...,a £n „ £ C v are arbitrary deterministic vectors^ We 
shall further assume that, as N —> 00, limsup^ ||Cjv|| <00. 

The vectors yi, • ■ •, y(i- En ) n will be considered the le¬ 
gitimate data, while ai,...,a Enn are deterministic unknown 

2 As shall be seen in Section m the vectors a^’s can be considered random 
as long as they are independent of the y^’s. 


outliers. It is important to note at this point that all estimators 
of Cat considered in the following are invariant to column 
permutations in Y so that we can freely assume the first 
columns of Y to be the legitimate data and the last columns 
to be the outliers. Note also that we consider here a more 
general setting than Gaussian legitimate data as we merely 
request the x,’s to have independent normalized entries with 
some bounded moment condition. 

Although ai,..., a En „ are arbitrary, for technical reasons 
we shall need the following control. 

Assumption 1. limsup„ ||£X)j=i C^ 1/2 a i ajC) v 1/2 || < 00 . 

Note that, if lim sup„ e n n < 00 , Assumption |T| reduces to 

limsup„maxi<j< £n „ -^a^C^a* < 00 . 

If one were aware of the presence and position of out¬ 
liers in the dataset, then the natural estimator for Cat (up 
to renormalization by 1 — e n ) would read -Ly°Y°^ with 
Y° = [yi,..., y(i_ e „)„]; this estimator, which we shall refer 
to as the Oracle estimator (hence the “o” superscript), merely 
consists in a SCM with discarded outliers. For lack of knowing 
the outliers presence and positions, the immediate alternative 
estimate for Cat is the SCM, which reads here A.YY'f. if 
one is only interested in estimating any scaled version of 
Cat, then, to mitigate the negative impact of outliers with 
arbitrarily large norm, a simple robust procedure consists in 
estimating Cat via the normalized SCM A-Y n Y n1 \ where 
Y n = Ydiag(-^-Y^Y)~5. This matrix has the advantage of 
avoiding arbitrarily large biases in the estimation of Cjv- How¬ 
ever, being only based on a per-data norm control, 1Y" Y" ' 
does not take into account the fact that outliers can also be 
detected if they significantly differ, not just in norm, from 
the majority of the data. The robust estimators of scatter, 
introduced by Huber 0 and later studied by Maronna 12], 
were precisely designed for this purpose. Our objective here 
is to finely understand this outlier identification and mitigation 
procedure by means of a large random matrix analysis. 

To be able to define a robust M-estimator of scatter in 
the sense of Maronna under the presence of arbitrary outlier 
vectors, a constraint must be set on e n and N. In particular, 
as n grows large, we shall require that n( 1 — e n )/N (and not 
only n/N) be always beyond one. 

Assumption 2 (Growth rate). As n A 00 A £ £ [0,1) 
and c n — — —^ c with 0 < c < 1 — e. 

We then define Maronna’s M-estimator of scatter Cat as a 
solution, when it exists, to the equation in Z 



where u : [0, 00 ) —> (0, 00 ) is continuous, non-increasing, and 
such that tj>{x) = xu(x) is increasing with lim^^^ tj>(x) = 
(f >oo and (1 — e) _1 < q ^ < c -1 . Note that the latter 
assumption on q is equivalent to that in |@] with a slight 
modification accounting for the presence of outliers. 
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A standard choice for the function u is u = us, where, for 
some t > 0, 


u s (x) = 


1 t 

t + X 


( 2 ) 


which, for an appropriate t, turns Cat into the maximum- 
likelihood estimator of C^v when the columns of Y are in¬ 
dependent multivariate Student vectors (hence the superscript 
“S”)- As t —> 0, Cjv converges to one version of the so- 
called Tyler estimator fl], as shown in [ 15]f] We shall however 
restrict our study here to Maronna’s class of estimators. Of 
particular interest in the present work is another function it, 
which we shall (somewhat abusivehQ) refer to as Huber’s 
estimator function «h, defined, for some / > 0, as 


un(a:) = max < 1, 


1 + t 

t + X 


(3) 


This function has the particularity of being constant for all 
x < 1, which will be later seen as an important property. 


III. Main Result 

From the problem setting. Assumption [2] and t3 
Thm. 2.3], it is easily seen that, with probability one, the 
solution of © is unique for all large n and thus C N is 
unequivocally defined. In the same spirit as in HOI (and 
with similar notations), our first objective is to find an explicit 
tight approximation of the implicitly defined Cat in the regime 
where N, n —> oo as per Assumption [2] Our main result 
unfolds as follows. 

Theorem 1 (Asymptotic Behavior). Let Assumptions \J\\2]hold 
and let Cat the solution to © (unique for all large n, with 
probability one). Then, as n -A oo. 


Remark 1 (Function v). The function v defined in Theorem [7] 
was already introduced in Ml and uses, through g, the 
assumption that (f>(x) < c -1 . It has essentially the same 
general properties as u in that it is continuous, non-increasing 
and such that ip(x) = xv(x ) is increasing and bounded with 
lim^oo ip(x) = Ipoo = </>oo/(l — cfi oo). 


Remark 2 (Relation to previous results). Taking e n = 0, 
Theorem Q] reduces to the result obtained in Mi and Q 
i.e., S N = j)( 7 „)iYY t . In this case , © reduces to 


7 n = 


1 + CV( Jnhn 


V(ln) 


which, after basic algebra, entails "f n = f 1 (1)/(I — c) and 

Hin) = 1 / 0 _ 1 ( 1 )- 

Theorem |T] allows us to transfer many properties of the 
implicit matrix Cat into the more tractable matrix Sat, the 
random matrix structure of which is well known and has been 
studied as early as in 1 11911 . The structure of Sat is particularly 
interesting as it mostly consists of two terms: the sum of outer 
products of the legitimate data scaled by a constant factor 
u( 7 „) along with a per-sample weighted sum of the outer 
products of the outlying data. Therefore, as one would expect. 
Cat sets a specific emphasis (either small or large) on each 
outlying sample while maintaining all legitimate data under 
constant weight. We expect here that, as opposed to the SCM 
that provides no control on the data or to the normalized SCM 
that merely normalizes the outliers. Cat will appropriately 
ensure a reduction of the outlier impact by letting v(aj^ n ) 
be quite small compared with v("/ n ), especially if e n is small. 

An immediate corollary of Theorem Q] concerns the large 
N eigenvalue distribution of Cat and reads as follows. 


where 


Cjv-Sat 




Sjv = V (rtn) - Y] yiyj + -Y» (a lt n) a t a\ 


Corollary 1 (Spectral Distribution). Define the empirical 
spectral distribution f\y (x) = (c N )<x} f or 

x £ R. Then, under the setting of Theorem [7] 

Fg"(x)-F N (x)=i>0 


almost surely as n —^ oo, where Fn(x) is a real distribution 
with v(x) = u ( y g~ 1 (x)'), g(x) = x/(l — ccf>(x)), and function with density defined via its Stieltjes transform msf(z) 


(in, cti,n, ■ ■ •, a en n,r!.) the solution to 


ln = ^trC 


N 


(1 - £)v(ln) 

1 + CV( ln)ln 


1 E " n + \ 
Cat H y^tt(ctj, ra )a^-at 

J 


V -1 




1 t [ (1 ~ £)v(ln) r , 1 


N ' 1 + CV( 7 n )7 r 


Cat + - y^tt(qj,n)ajat a.; 
n , I 




(4) 


for i = 1,..., e n n. In particular, from Thm. 4.3.7], 


max 

l<i<N 


Xi(C N ) - Xi(S N ) 


0 . 


(i.e. Jf| niN(z) — f (t — z) 1 dF^(t)) given for all z £ C + by 


m N (z) = — tr 


(l-e>( 7 „) 


J N “I- Aat 


Aw — zl 


N 


N \ 1 + &n(z) 

with Aat = 7 Yi=i v ( a i,n) a t a i an d ej\r(z) the unique 
solution in C + of the equation 

-l 


e N (z) = fr Cat 


(1 - e)v(in) 
1 + eAr(^) 


C N -h A n — zl 


■N 


In the appendix, it is importantly shown that 
limsup^ || Civ|| < oo a.s. (as a result of lim sup^ ||Siv|| < oo 


3 As opposed to Maronna’s class of estimators, Tyler estimator is only 
defined up to a constant factor; thus it estimates Civ up to a scale parameter. 

4 Huber’s original estimator takes the form u(x) = max{a, fl/x} for some 
a, (3, hence with additional parameters and with t = 0. However, uniqueness 
of Cm is not guaranteed for t = 0 and, in the random matrix limit, a = 
P = 1 is a particularly appealing choice. 


5 Recall that any distribution function F is uniquely defined by its Stieltjes 
transform m(z) by the fact that, for all continuity points a, b of F, 

F(b) — F(a) = lim f £y[m(£ + iy)]dt. 

2 / 4-0 J a 
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a.s.). This implies that F^ N and /yy have compact 
supports and are fully determined by their respective 
moments M^ N k = J t k dF^ N (t) and Mjv.fc — / f fc dF/v(i), 

k = 1 , 2 ,..., which satisfy M^ N k — ^4 0 (by the 

dominated convergence theorem). While f’y is defined via 
its deterministic but implicit Stieltjes transform, the MN : k 
can be retrieved explicitly using successive derivatives of the 
moment generating formula (for |^| < 1 /sup(Supp(F/v))) 

OO 

m N (lfz) = - Z k+1 M N>k . 
k—0 

Precisely, we obtain here the following result. 

Corollary 2 (Moments). For Fjy defined in Corollary [7] 
letting Mn^p = J t p dFjj(t), p = 1 , 2 ,..., 

(-1 )p 1 

MN ’ p = ^TN tlTp 

where T p is obtained from the following recursive formulas 


p p i / \ / \ 

T p+ 1 = —AjyT,; + ( J ( j Tp_»Qj_j +1 Tj 

»=o i=o j =o ' ' 

Qp+i = (p+ l)/p(l - e)v(y n )C N 

fp+i = (f) (P “ * + 

i=o j=o W W 

/3 P +t = v(7n)~tr CjvT p+1 , 
n 

with initial values T 0 = Iat, / 0 = —1, /3 0 = v(j n )£ tr Cat- 
In particular, 


M N} i = — tr [Ajv + (1 - E)v(y n )C N ] 
Mn ,2 = jy tr A^ + 2(1 — e)w( 7 n)CjvAjv 


+ (1 — £) v (7 n )Cjy + 


■ tr C 


JV 


(1 - e)v 2 ("/ n )CN 


Albeit having characterized the random matrix S,.y, which 
approximates the behavior of Cat for large N, n, it is quite 
challenging to gain a good intuitive understanding of the 
weight structure as the expression {4} relating y n to the oy.n’s 
is still implicit (while being deterministic). To get more insight 
on the properties of Cat, we shall successively consider two 
specific scenarios that simplify the system ©. 

IV. Finitely Many Outliers Scenario 

Let us first assume that E n n = K is maintained constant 
as n —> oo (thus e = 0). Recall that, in this scenario. 
Assumption Q] can be replaced by the sufficient condition 
limsupjy maxi<j< £n „ j^a*Cjfai < oo. In the appendix, it 
is shown that y n cannot grow unbounded with n. As such, by 
a rank-one perturbation argument iterated K times, see e.g., 
E Lemma 2.6], we find that 

l + c^b 

V{7n) 


which ensures by Remark [2] that 

7„ = lM+ 0 ( 1 /A). 


We shall denote next 7 = ^ (and thus v(j) = 1 /</> 1 (1)). 
Then we obtain that 



with 


1 n-K 1 K 

S'n = V ( 7 ) - y^l + - V ( a 'i,n) a M 


2=1 


2=1 


where a' n are the unique positive solutions to 



-1 


a*. 


As such, when the number K of outliers is fixed, the common 
weight 71 ( 7 n ) becomes independent of the vectors a,’s (even if 
they are of arbitrarily large norm) while the individual weights 
v(oti, n ) eventually solve a system of K equations involving 
the a,;’s and C y. 

A more specific case lies in the scenario where ai = ... = 
a k- There, ai >n = ... = O’K.n and the K equations above 
reduce to a single one reading 

a'l.n = + —^-^(oi.Jaiajj ai 


which, using at(A + taad) 1 = at A 1 /(1 + tad A 1 a) for 
invertible A, simplifies as 


a\ „ = 7 - 


1 at 

lV a l^Af a ! 


or equivalently 


a 


1 ,n 


1 — c n (K — l)tp(a[ n ) 7 iV atlC - ai - 


Since the right-hand side is positive, so should be the left- 
hand side, which may then be seen as an increasing function 
of a\ n . Thus, since 7 depends neither on Cjy nor ai, it comes 
that o'| n is an increasing function of -(raj C ^, 1 a-i. Moreover, 
a\ n < if- 1 (1 / {c n {K — 1))) and thus converges to zero as 
K grows large. When K = 1, and thus the outlier is now 
isolated, this reduces to 

o'i, 22 = 7^ a i c iv 1 a 1 • 

This short calculus leads to two important remarks. First, 
for I\ = 1 , Cat asymptotically allocates a weight 71 ( 7 ) to 
the legitimate data and a weight ^( 7 -^ajCiy 1 a-j) for the 
outlier. As a consequence, by the non-increasing property of 
v, the effect of the outlier will be (for most choices of the v 
function) attenuated if -^afc^ai > 1 but will be increased 
if w a i C iv la i < 1. As such, the robust estimator of scatter 
will tend to mitigate the effect of outliers a x having either 
large norm or, more interestingly, having strong alignment to 
the weakest eigenmodes of Cjy. In particular, note that when 
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Cat = Ijv, Cjv will mostly control outliers upon their norms 


^ ||a!|| 2 , which is essentially what the normalized SCM 

_ (1 — £ n )n 

yiyl 


(1-E„ )n 

—Y n Y nt = — V 

" » U illyill 3 


-Y.' 


a, a, 




(5) 


would do, and thus there is no gain in using robust estimators 
here. However, if Cat has large dimensional weak eigenspaces 
(i.e., close to singular with most eigenvalues near zero), 
-^ajc^a! may be quite large, and thus a x may be strongly 
attenuated. But if a! aligns to the strong eigenmodes of Cat, 
the impact of ai may be enhanced rather than reduced. To 
avoid this effect, undesirable in most cases, it is crucial to 
appropriately choose the u function. Specifically, the function 
v should be taken constant for all x < 7, or equivalently, u(x) 
should be taken constant for x < 4>~ 1 ( 1). A natural choice is 
the Huber estimator u = i/n introduced in ([3]). 

The second remark is a slightly more surprising outcome. 
Indeed, despite n being potentially extremely large, the pres¬ 
ence of (already few) K > 1 identical outliers drives Sat (and 
thus Cm) t0 allocate large weights v(ai, n ) (since is small) 
to these outliers, therefore seemingly contradicting the very 
purpose of the robust estimator. This seems to indicate that Cat 
has the propensity to put forward both large quantities of data 
with similar distribution as well as rather small quantities of 
vectors with strong pairwise alignment, while more naturally 
rejecting isolated outliers. 

In terms of large dimensional spectral distribution and 
moments, the scenario of finitely many outliers is asymp¬ 
totically equivalent to the outlier-free scenario. This can be 
observed from a rank-one perturbation argument along with 
e n —> 0 applied to Corollaries A similar reasoning would 
hold for the normalized SCM. However, the matrices Cat 
and 1 Y^ Y' 1 ' themselves experience a (maximum) rank-iT 
perturbation which can severely compromise the estimation 
of Cat, along the previous argumentation lines. 

Figure |T| displays an artificially generated scenario where a 
single outlier ai of norm -^||ai|| 2 = 1 produces a large value 
for -^ajCj^ai (= 14.50), thus entailing a strong attenuation 
by Cat. The terms ai and Cat were made such that the SCM 
and normalized SCM have the same asymptotic eigenvalues 
and produce an isolated eigenvalue (around .25). The spectra 
of the latter are compared against those of Cat and the oracle 
estimator. It is seen that the isolated eigenvalue, which is 
naturally not present in the spectrum of the oracle estimator, 
is also not present in the spectrum of Cat. indicating that Cat 
has significantly reduced its impact on the spectrum. 



Fig. 1. Eigenvalues of the SCM (TYYt), normalized SCM (4-Y n Y n t), 
Cat for u = ug with t — .1, and the oracle estimator (j-Y°Y" j; N = 100, 
c = .2, £ n n = 1, ai = (aj,a 2 )t, a} £ R 10 , a 2 £ R 90 , with aj i = x/TO, 
a 9 i = 0, such that ||ai|| 2 = N\ y, = C 2 x, with standard Gaussian 
and Cat = (16/14.50) diag(ci, C 2 ), ci E R 10 , C 2 E R 90 , with cn = 
1/16, C 2 i = 1, such that trCjv = AT. Ellipse around the outlier artifact. 


of Corollary [3] we shall use the subscript “R” standing for 
“random outliers scenario”. 

Corollary 3 (Random Outliers). Let Assumption [2] hold with 
e > 0 and let ai,...,a £lin be random independent of the 
y fs with a, = D^ 2 x', where Dat G C iVxW is deterministic 
Hermitian positive definite and xj,... ,x^ n are independent 
random vectors with i.i.d. zero mean, unit variance, and finite 
( 8 +T])-th order moment entries, for some p > 0. Let us further 
assume that limsupAr ||DjvCj^ 1 || < 00. Then, as n —» 00, 



»= 1 2=1 


with 7 ^ and the unique positive solutions to 

R _ _j_ , f~i ( (1 - £)v(ln ) C N CT(tt!?)DiV \ 

N N \ 1 + cv(jn)7n 1 + cv(a*)a%J 

R = 1 n ( (1 - eM7n ) C N £u(q^)DAf \ 

" N l N \ 1 + cu( 7 M l + 


Another interesting case study that shall provide further 
insight on Cat is that where the a^’s (possibly numerous) are 
independently extracted from a different distribution to that of 
the yj’s. This is pursued in the subsequent section. 

V. Random Outliers Scenario 

Assuming ai,..., a EnT ,. to be independent with zero mean 
and covariance D y / Cat provides a rather immediate 
corollary of TheoremQ] given below. In the results to come, to 
differentiate between the conditions of Theorem |T] and those 


In particular, for F^ N {x) as defined in Corollary [7] 

x)^0 

almost surely as n —> 00, where F^(x) is a real distribution 
function with density, defined via its Stieltjes transform 


m^(z) 


Eat 


(1 - £)u( 7 ^) 
1 + e N ,i{z) 


C N + 


1 + e-N,2\Z) 
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.05, and u(x) = (1 + t)/(t + x) where t = .1. 



Fig. 3. Mean and standard deviation (error bars) of || Cat-8 *11/110*11 
for c„ = 0.25, [C*]y = [D*]y = e n = .05, and 

u(x) = (1 + t)/(t + x), with t = .1. 


for z G C + and (ejy,i(^), ejv,2(^)) the unique solution in 
(C +) 2 to 

( R\ 

cn,i(z) = —— tr Cn (Eat — zIn) 1 
n 

ejv,2 (z) = tr D* (E* - zl^)" 1 . 
n 

Figure [2] shows the density of the distribution E[.Fy ], 
obtained from Monte-Carlo averaging, versus F* for different 
values of N, n. It is observed that, as soon as N is of the order 
of several tens, the asymptotic approximation holds tightly. 
The (normalized) distance in spectral norm between Cjv and 
S 'y is numerically evaluated in Figure [3 for various values 
of N. As suggested in the second order analysis of Eh, 
IICat — || (or ||C* — S*|| here) is likely to decay at the 
rate 1 / y/N , which is somewhat confirmed by observing that 
between N = 20 and N = 80, the approximation error decays 
by a factor of two (precisely, 0.042 versus 0.019). 

In the random outliers scenario. Cat is asymptotically 
equivalent to the weighted sum of two partial sample covari¬ 
ance matrices, one corresponding to the legitimate data and 


the other to the outlying data. In the defining equations for 7 ^ 
and a * an interesting symmetrical interplay arises between the 
weights applied to the legitimate and the outlying data, which 
are only differentiated by e. In particular, if e > 1/2, the a, ’s 
will be considered legitimate (being in majority) and the yj’s 
become outliers. 

Despite the symmetrical form of the equations defining 7 ^ 
and it remains difficult to extract general insight on these 
quantities. Thus, again, it is interesting to study the regime 
where £ — > 0. In this case, 7 ^ — > 7 = 1)/(1 — c), and 

As such, the factor dictating the outlier mitigation strength 
of Cat is now tr D.yC ^ 1 . Similar to before, when larger 
than one, the impact of the outliers will be reduced but these 
might be enhanced when smaller than one. Interestingly, if 
■^trDjv = -^-trCAr = 1 (say), both legitimate and outlier 
samples have similar norm for all large n. As such, under this 
scenario, the SCM or its normalized version 1Y" Y" ' 

behave asymptotically equivalently, neither of which being 
capable of differentiating between legitimate and outlier data. 
On the contrary. Cat is capable of reducing the impact of the 
outliers as long as jj- tr DatC^ 1 > 1. Note here again that C at 
must be sufficiently distinct from I*, which would otherwise 
entail jr tr O yC y 1 ~ 1 and thus Cat would be indifferent to 
outliers. Also, similar to previously, u must be well chosen to 
avoid enhancing the outlier effect if jj tr Dat <#< 1 (so in 
particular it is advised that u be similar to %). 

Figure [4] depicts the previous observations in terms of the 
deterministic equivalent spectral distributions: f’y of Cn, 
Ffp M of ±YYt (or F“ scm of lY n Y nt which satisfies 
Fff M = Ff CM here), and F™ acle of the outlier-free oracle 
estimator iY°Y°^; we take here Cn and D y to ensure 

tr DatC ^ 1 large and e is taken small. The sought-for 
distribution that would optimally discard all outliers is the 
oracle distribution and, thus, highly robust estimators are 
expected to have a similar distribution. Figure Q] confirms that 
this is indeed the case of Cat which shows a close tail behavior 
but is slightly mismatched in the main distribution lobe. On 
the contrary, the SCM (normalized or not) shows a strong 
decay in the main lobe and a non matching tail. The associated 
theoretical values of 7 ^ and for e n = .05 are here 
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Fig. 4. Density of the approximate (deterministic) spectral distributions for 
the outlier-free oracle (F^ acle ), the SCM or normalized SCM (F^ CM = 
FSf CM), ant ( C.y ( F,y). with u = with parameter t = . 1 , [Cjv]iy = 
.9^-7'!, D n = I N , N = 100, c = .2, and e = .05. 


^h(7^) — 1-00, uh(o^) — -1219, while in the limit e n —> 0, 
these values become vh(7u) 1 and fH(a^) —> -1179. 

As it appears from Figure |4] that the tail of the various 
estimator distributions may be strongly affected by a weak 
outlier control, it is interesting to investigate the impact on 
their moments. For this, we introduce the following application 
to Corollary [2] for the random outlier setting. 


Corollary 4 (Moments in Random Case). Under the setting 
of Corollary\J\ letting p = J t p dF^f(t), we have 


MN,p ~ 


(~1) P 1 
p\ N 


tr T 


R 

p 


where is obtained recursively as 


rpR _ 

A P+1 




rpR rvR rpR 


Qp+1 = (p + 1) [(1 - e)/i,pR-i + e/2, P R2] 


fk,p +1 — EE; (P * “I" l)fk,jfk,i—jPk,p—i 

i=0 j=0 kV 

Pk,p +1 = — tr RfcTp +1 , 


with initial values Tq = Iat, f^xs = — 1, 0k,o = — tr R;-, and 
with Ri = u( 7 ^)Cat, R 2 = u(a^)Djv. In particular, 


m n, 1 = ^ tr + (1 - e) u (7n ) c iv] 

1 f { £v ( a n)^N + (1 ~ e ) ,(; (7ra)Cjv) 2 


m n, 2 = tr 
+ £t) 2 (a^)DAr 


1 


tr D 


N 


+ { i- £ y(^)c 


N 


1 


■ tr C 


N 


As expected, C y induces a bias in the mean. For fair 
comparison with the normalized SCM, which estimates Cat 


p = 2 p = 3 p = A 


M™ le 

(error) 

Mf™ (error) 


9.28 

9.18 (1.1%) 
8.53 (8.2%) 


129 

126 (1.8%) 
112 (13%) 


1993 

1945 (2.4%) 
1660 (17%) 


Fig. 5. Normalized moments versus and relative 

error | ■ Random outliers, N = 100, c = .2, [Cjv]iy = 

. 9 I*—j|, D^r = Ijv, £ = .05. u = u H ,t = .1. 


up to a scale constant, let us define the normalized moments 


Mjv iP = 


M n , p 

M N p 


and define similarly M jy p as well as for the SCM, 

M^CM f or jjjg llorma ii ze d SCM, and M™p Cle for the oracle 
estimator. Under the same setting as in Figure @] we provide 
in the table of Figure [5] the successive normalized moments 
and relative error compared to In this case, MSCM = 

M»SC M p or th e scenario at end, given the large support of 
Fjf, even low order moments tend to take large values so 
that the asymptotic moment approximation only theoretically 
holds for p rather small when N = 100 and we thus only 
provide these first order moments. The results demonstrate an 
important advantage brought by Cjv versus the SCM in that 
the first few order moments are better preserved. 


VI. Discussion and Concluding Remarks 

Our study of the robust estimator Cat in the large random 
matrix regime has already led to several interesting conclu¬ 
sions, which we shall more thoroughly address in this section. 

Most investigations of robust estimators of scatter focus on 
the more tractable case where the samples (i.e., the columns 
of Y) are independent with identical elliptical distribution. 
The recent results of have revealed that, as u(x) 

gets close to the Tyler 1/x function, in the large random 
matrix regime. Cat tends to behave similar to the normalized 
SCM defined in ( 0 . This conclusion was quite pessimistic 
as it suggested no real improvement of Cat over simplistic 
alternative robust methods. In the concluding remarks of M 
Section 4], the authors anticipated a change of behavior of Cat 
versus the normalized SCM for deterministic outlier data. This 
was revealed here both in Section [TV] and in Section [Vl where 
it is made clear that, unlike the normalized SCM, the robust 
estimators of scatter smartly detect the outliers, essentially by 
evaluating and comparing the quadratic forms y ' C'^. 1 y for 
each column vector y of Y. Larger y^C^y imply more 
attenuation of y within the observed samples. However, an 
incidental consequence of this behavior of Cat is that small 
values of y^C^y enhance the effect of y even though it 
might not comply with the legitimate sample distribution, thus 
increasing the probability of inducing false alarms. This has 
led us to conclude that the function u should be adequately 
tuned to avoid such a phenomenon. Another consequence is 
that matrices Cat with legitimate data of covariance Cat close 
to the identity will have very poor outlier rejection properties. 

When the outliers are few, the empirical spectral measure 
f Cn of 

Cat is asymptotically the same as that of the SCM, 





















normalized SCM, and oracle estimators. As such, if one’s 
interest is on functionals of the eigenvalues of C y, such as 
moments, and only few outliers are expected, sophisticated 
robust estimators come to no avail. This being said, the out¬ 
liers may naturally engender extra isolated eigenvalues (only 
finitely many) in the spectrum of which Cat might 

suitably remove while the normalized SCM may not (recall 
Figure [TJ. For subspace detection and estimation applications, 
where the information often lies in the eigenvectors of isolated 
eigenvalues, discarding such outlying information is critical 
and thus robust estimators may bring important performance 
gains. For instance, applications in finance and biostatistics 
(where data are often assumed to contain outliers) heavily rely 
on isolated eigenvalue-eigenvector pairs, see e.g., 1201121 1. The 
experimenter must however keep in mind that, according to our 
analysis. Cat is most effective at automatically suppressing 
isolated outliers (the less of these relative to the legitimate 
samples the better) and loses discriminatory power as the 
outliers approach one another. 

The observation made in Section [V] that the distribution (in 
particular through its first order moments) is much closer 
to the oracle estimator than would the (normalized or not) 
SCM be leads to some interesting applications when it comes 
to designing improved estimators for Cat that both account for 
the fact that n is not large compared to N and for the fact that 
the observed data are prone to outliers. Such investigations 
were successively made in [22] for the finite N, n regime 
and later in IllOh for the large N, n regime where hybrid 
Ledoit-Wolf [23] and Tyler |8[] estimators were proposed that 
improve the estimation of Cat by providing an extra degree of 
freedom (a regularization parameter) which is selected so to 
minimize the expected Frobenius norm between Cat and the 
estimator under study. Since the Frobenius norm is nothing but 
a functional of second order moments, the observation made in 
the table of Figure [5] strongly suggests that the Ledoit-Wolf 
estimator alone (being based on the SCM) would be quite 
sensitive to deterministic outliers while the estimators studied 
in [lo[ 22], which are essentially of a similar class as Cat, 
would be much more resilient to such outliers. 

When the number of outliers is much larger, even in the 
random outlier scenario studied in Section [V] very little can 
be said. However, we noticed an interesting symmetry in the 
equations defining the weights 7.^ and ojj of Corollary [3] 
which reveals that the asymptotic proportion e of outliers 
versus 1 — e of legitimate data could tip for e > .5 towards 
letting the outliers be considered as the truly legitimate data. 


In summary, the present study provides a first step towards 
a better understanding of the behavior of (classical) robust 
estimators of scatter against arbitrary outliers. Our findings 
underline several key aspects of such estimators of profound 
practical relevance, such as the importance of the population 
covariance matrix Cat of the legitimate data in the rejection 
power of the estimator, as well as the risks inherent to using 
weight functions u of the Tyler type. Nonetheless, this study 
remains at the theoretical level of the estimator itself and does 
not consider the implications when used as a plug-in estimator 
in detection or estimation methods. Whether these methods 


are based on local information (isolated eigenvalue, specific 
eigenvectors, etc.) or global information (functional of the 
eigenvalues, projections on large subspaces, etc.) about Cat 
will entail significant differences in the way Cat, through the 
weight function u, must be tailored. Such considerations are 
left to future investigations. 

Appendix A 
Proof of TheoremQ] 

The main technical difficulty of the article lies in the proof 
of Theorem Q] which extends the methods developed in |9] 
to multiple sample types. The present section is dedicated to 
this proof. Some auxiliary random matrix results will be then 
listed in Appendix [B] while Appendix [C] will deal with the 
(rather immediate) proof of Corollary [2] 

The proof of Theorem |T| is divided in two parts. First, we 
show that the system of fixed-point equations @ admits a 
unique vector solution and that such solution is bounded as 
n 00. This then defines unequivocally the matrix Sat. We 
then show in a second part that ||Cat — Sjv|| ——> 0. 


A. Existence, uniqueness, boundedness of the solution to 0 

To prove existence and uniqueness, we use the framework 
of standard interference functions [24]. 

Definition 1. A function h = (ho,..., h s ) : —> R^ +s is 

a standard interference function if it satisfies the conditions: 

1) Positivity: if q 0 ,..., q s >0, then hi(q 0 ,... ,q s ) > 0 for 
all i. 

2) Monotonicity: if q 0 > q' 0 ,..., q s > q' s then, for all i, 
hi(qo, ■•■,?«)> hfiq'o,.. .,q' s ). 

3) Scalability: for all S > 1 and all i, Shi(qo,... ,q s ) > 
hfiSqo,..., Sq s ). 

By [24, Thm. 2], if h is a standard interference function for 
which there exists (qo,..., q s ) such that qi > hfiqo ,.... q H ) 
for all i, then the system of equations qi = hfiqo,.. ■, q s ), 
i = 0 ,..., s, has a unique solution. 

Define h = (h 0 ,..., h erlU ) : R+“ Enn —>• R^ +£nn with 


ho(qOr ■ • * ) Qe n n) — 

1 


N 


:tl'C 


N 


(1 - e)v(qo) „ 1 1 : 

Cat H— / V (q 

j= 1 


1 + cv(q 0 )qo 


hi(q 0 : • • • 1 tierin') 

J_ a t ( (1-eMgo) 
N M 1 + cv(q 0 )qo 


j) a ( a ; 


-1 






for i = 1,, e n n. Let us prove that h meets the conditions of 
Definition |T| and that, for i = 0,..., e n n , hfiqo ,..., q Erl n) < 
qi for some (qo,..., q £nU ), which will then prove existence 
and uniqueness. 

From Assumption |T| and the fact that v is bounded, we 
clearly have K > 0 for all i. To show monotonicity, let us 








9 


first define 


Bjvteo, ■ ■ ■ , Q En n) 


(1 - e)v(qo) 
1 + cv(q 0 )q 0 


■'N ' 


£nn 


j =i 


a ? a ; 


and take g 0 , • ■ ■, g £ „n and q' 0 ,..., g' n „ such that gj > q[ for 
all i. Then, since v is non-increasing and i^{x) = xv(x ) is 
increasing. 


B At(< 70 , ■ • ■ , Qe n n) ^ B N (q' Q , . . . , g' n J- 

From 03 Cor. 7.7.4], this implies 

(BArteo ,--t (Bjvteo.---.9Ln))” 1 

from which h 0 (q 0 ,... ,q Enn ) > h 0 (q ' 0 ,..., q' Snn ). By the 
same arguments, hi(q 0 ,..., g £n „) > hi(q ' 0 ,.... g' n J for 
i = 1,..., £ rl n, thus proving the monotonicity of h. Finally, 
to show scalability, let us rewrite Hq as 


^-oteo, - - * , 9e n n) 

1 + r. ( n ©teo) r, , 

— trCjv (1-e)- Cat + - > - 

iV l go n ]~^ q i 


-1 


' a / a I 


where 0(a;) = ■ Since tp(x) is increasing, so is 0(a;) 

and, for any S > 1, 


h 0 (Sq 0 ,. . • ,8q Sn „) 


= N trCN 


(1 - e)0(i5go) 


go 


Cjv + -y: 
n 


j =i 


HMp). 

Qj 


a J a j 


< Sh 0 (q 0 ,.. . ,g £ „„). 


We show similarly hi(Sq 0 ,8q en n) < 6hi(q 0 , g e „ n ) for 
i = 1,..., £ n n, thus proving the scalability of h. 

Thus, h is a standard interference function and it remains to 
show that hi(q 0 , ■■■, q En n ) < qi for some (g 0 ,..., g £ „„) and 
for all i. For i = 0, 


Mgo, • ■ ■, ge n n) — jy tf Cjv (Bat( go, ■ • •, g £ „ 


»))' 


where 

Bjv(go, ■ ■ ■, g £n n) — 

and thus, by definition of ijj, 

/io(go, • • •, g £ „n) ^ 


(l-g)^teo) , 

1 + cv(q 0 )qo 




1 + c0(g o ) 


:9o- 


( 6 ) 


(l-g)0(go) 

As a consequence, we need to find some g 0 for which 
- 1 or - eq uivale ntly, 0(go) > Such 

a choice of go is always possible since f> is increasing 
on [ 0 , oo) with image [ 0 . i/ioo) where 1 _ 1 _ c < ipoo (this 
unfolds from > —^r). Therefore, for any q 0 such that 
1^37 - V'(go) < 'ipoo, we have h 0 (qo, ■ ■ ■, g e „n) < go- 
Take for instance g 0 = - 0~ 1 ( 1 _L C ) anc * cons id er now the 
functions hi, i = l,...,s n n for which, using [23, Lemma 
10 ] and similar arguments as above. 


, , \ ^ l + cV , (9o) 1 tr ~ | —1 

te(go, • • •, Qe n n) < 9o (1 _ £) ^ ((?o) ^ a, 
= q °N a i Cn a ' =Wi ‘ 


(7) 


Therefore, taking g.; = w, for i = 1,... ,e n n, we also have 
hi(qo ,..., g £n n) < gj- Altogether, we have shown that the 
function h satisfies the conditions of 0 Thm. 2 ] implying 
that there exists a unique solution to IQ}- As such, Sjv as 
introduced in the statement of Theorem Q] is well-defined. 


We now turn our focus to the boundedness of the solution 
to ©. From ([ 6 } and ©, along with Assumption |T] we imme¬ 
diately have that ( 7 n , ai ^ n ,..., a £ „ ra ,n) is uniformly bounded 
in n, i.e., lim sup n 7 „ < 00 and limsup„ maxi<j< Enn a^ n < 
00 . Furthermore, 7 .,, can be shown to be uniformly away 
from zero as follows. By monotonicity of the h function, 
tetteo, • ■ •, 9 s„n) > h o (0,..., 0 ), i.e.. 


Mgo,---,g e „n) — at ©-jv — 


u(0) N 


1 


U(0) IIHjvir 


where the matrix Hat is defined as 


Hjv = (1 — £)Ijv + - £ C^ 1 / 2 a,a]Cte 1/2 - 

Tl 

3 =1 

By Assumption Q] we have limsup n ||Hjv|| < 00 and, conse¬ 
quently, lim inf n 7 n > 0 . 


B. Convergence of Cjv — Sjv 

Having proved that Sjy is well defined, we now turn to 
the core of the proof of Theorem [T| The outline of the proof 
follows tightly that of [[9, Thm. 2] but for a model that is 
(i) simpler in its assuming the legitimate data to be essentially 
Gaussian instead of elliptical, but (ii) made more complex due 
to the deterministic addition of the vectors ai,..., a £n „. Our 
way to deal with (ii) is by controlling in parallel the quantities 
asymptotically approximated by 7 „ and those asymptotically 
approximated by a ltTlj . Since some parts of the proof mirror 
closely those in |9| Thm. 2], we shall mainly focus on the 
significantly differing aspects. 

First note that we can assume Cjv = I at by studying 
C^y CnC n instead of Cjv, in which case we have 
Cf 1/2 a,. in place of the original aj. This can be seen from 
(UJ. the implicit equation solved by Cjy. Hence, from now on 
we assume Cjv = Ijv without loss of generality. Using the 
definition v(x) = u (gf 1 {x)'), with Qn(x) = x/(l — c n (j){x)), 
and following the same steps as in |9|], let us write 

1 (l-e„)n £ „n 

Cat = - ^2 V ( di ) Xjxj + -^2 v(bi) a z a\ 

2 — 1 2=1 

with di = ^xfC^Xj and b t = where C (xi) = 

Cjy —v (df x^xj and C( a .) = C^—v (bf ajat. Further define 

A V ( di ) r A 

/ \ 5 ji / \ 5 

v(7 n) V{ai,n) 

with 7 ?1 and a- Erl as in the statement of Theorem Q] but for 





















10 


Cjv = Ijv, he-, In and a i,n are the positive solutions to 

1 / (1 — e)t>(7 n ) T 1 v-' , x t 

In = — tr I 1 , _ In + - ^2 v a J a ] 


N \ 1 + cu( 7 n ) 7 „ n 


3 =1 


1 t ( (1 — £ ) v {ln) T 1 / \ t 

—a, x-:—;—Ijv + — > v (a-j n ) a,al- 

N 1 \ l + cu( 7 „) 7 r r,^ v 3 ' n> 3 3 


jV* 


The core of the proof is to show that 


max I a — 11 —7 0 

l<i<{l—£n)n 


max \fi-l\ 

1 <i<e n n 


0 . 


*r x I w(dj)x,-xT + i £ w^jJaj-aJ 

\ j =1 

«(7n) 


,t . 1 


a, at x. 


v I 37 ]V X 1 ( £ E «( 7 n) x i x J + £ £ ■’ t 




i=i 


a / a , x ' 


< 


Ain) 


max 

1 < 2 <( 1 —e n )r; 


—x'F 


txp-l 


0 . 


e (l-E„)n < 


V{ln) 


We can proceed similarly to bound /, from above as 


fi< 


( SnN a I G iV, (z) a? " 
v(ai t n) 


for any i = 1 ,..., e n n, with 

1 (1—e„)n ^ 

G -AT, (i) 4 - J2 v (7«) *3 X 1 + - X) u K>) a i at 
n j=l 

and we now use Lemma [2] in Appendix iBl which states 
1 


max 

l<i<e n n 


jy a i G iV,(i) ai tti 


0 . 


Therefore, for the same £ > 0 and for all large n a.s., 

^ 0 ) 


fs n n 7 


v{a itH ) 


We now consider separately the subsequence of n over 
which e (1 _ En)rl > / En „ and that over which e (1 _ £ „ )n < f EnU 
(these subsequences may be empty or finite). 

Subsequence e(i_ En ) n > f Enn : On this subsequence, (ITOb 
becomes 


-l 

a,. 

( 8 ) 

(9) 


e (l ~En)n ^ 




V(ln) 

or alternatively, since e( 1 _ en ) n is positive. 


i < 


V’(T'n) (l - £) 


Let us first relabel e* and /, such that ei < ... < e(i_ Eji )„ 
and fi < ... < f Enn and denote S n = max(e ( i_ En) „, / E „ n ). 
For any i = 1,..., (1 — e n )n, we have 


We want to prove that, for any £ > 0, e(i_ e ) n < 1 + £ for 
all large n a.s. Let us assume the opposite, i.e., e(i_ e ) n > 
1 +£ infinitely often, and let us restrict ourselves to a (further) 
subsequence where this always holds. Then, 


1 < 


^(iT?( 1 ~^)) < 


A(ln) (l - lp(ln) (l - 


where we used v(dj) = v(j n )ej, v(bj) = v(aj >n )fj and 

the inequality arises from e :j . f ;i < S n , from v being non¬ 
increasing, and from G3 Cor. 7.7.4]. For readability, let 

1 1 £n " 

F tV,(z) - - W ( 7 ™) x f x i + - W K>) a f a i- 

Tl Tl 

3 & 3 =1 

From the random matrix result. Lemma Q] of Appendix [B] 


From the uniform boundedness of 7,, away from zero and 
infinity (see Appendix IA-Ab . considering yet a further subse¬ 
quence over which 7 „ —>■ 70 > 0 , we obtain in the limit 

(‘ “ 7 ) s * (tt. 

This being valid for each £ > 0, a contradiction is raised in 
the limit ( —► 0. Therefore, either the subsequence over which 
e (l-£ n )n > fe n n is finite or e ( i_ e) „ < 1 + £ for all large n 
a.s. Assuming the former, then e(!_ en ) n < f Er%n for all large 
n, which is considered next. 

Subsequence e(!_ £n ) n < / e „ n : On this subsequence, (fTTb 
becomes 


N *i* NM* - In 
Thus, for C > 0, with probability one, we have for all large n 

(£(7n-C)) 


fe n n < 


( fe n n 


( 12 ) 


( 10 ) 


^(ae n n,n) 

for all large n a.s. Again, we wish to prove that with, say, the 
same l > 0 as above, f Enn < l+£ for all large n a.s. Consider 
first the case liminf n ct £n n,n = 0 and restrict ourselves to 
those converging subsequences over which a en „ >n —> 0. In 
the limit, v{a Enn ^ n ) -7 i>( 0 ) so that, for any 9 > 0 and 
for n large enough, v{a Enn ^ n ) > u(0) — 9. This, along with 
v{l/fe n n(a En n,n ~ 0) < ^(0) gives f n < v{0)/(v(0) - 9) 
for all large n implying that, for any £ > 0 , /„ < 1 + £ for all 
large n a.s. Consider now the rest of subsequences for which 
liminf„ <Tr„n,n > 0 and rewrite as 


^{O-enn.n) ^1 ) 


As above for e(!_ En )„, we assume / En „ > 1 + £ infinitely 
often, and restrict ourselves to a further subsequence where 
this holds for all n. Then, 


( 11 ) 


1 < 




•4>{a Enn ,n) (l Qen C n n ) 
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From the boundedness of a £rira , n (see Appendix I A- Al l. we can 
take a converging (further) subsequence over which a £nn , ra —»• 
«o > 0. In the limit, 

(' - i ) s * (iTf) 

which is contradictory for sufficiently small (. Thus, necessar¬ 
ily f erin < l+£ for all large n a.s., unless we have e(i_ £n )„\ > 
fe„n in which case, as shown above, f ey%n < e(i_ £n )„) < l+£ 
for all large n a.s. 

Altogether, we necessarily have 

max{e (1 _ £n ) n , / £ „ n } < 1 + i 

for all large n a.s. All the same, by reverting the inequalities, 
we prove that, for all large n a.s. 

min{ei,/i} > 1 - £ 


and therefore, altogether. 


max \ei — \\ < £ 

l<i<(l— e)n 

max \fi -l\<£ 

l<i<en 

for all large n a.s., which eventually proves © and © by 
taking a countable sequence of £ going to zero. This establishes 
the main result, from which Theorem Q] unfolds. Specifically, 
from ©-© and by uniform boundedness of j n and a, : n, 

max | v(di) - v( 7 „)| -^4 0 

l<i<(l-e„)n 

max | v(bi) - v(a itn )| -^4 0 . 

l<i<e n n 

Thus, for any l > 0 and for all large n a.s. 

(1 — £)Sn d C N A (1 + £)Sn 


and, therefore ||Ctv — Sjv || < 2^11 S jv11- Using the triangle 
inequality and the fact that v is non-increasing, we have 


||Cjv — Sjv || 


From [2<| and Assumption [2] ||A x * x !ll < 

4(1 — e) for all large n a.s. and, from Assumption Q] 
limsup„ || U a*a] || < oo. Then, since l is arbitrarily 

small, || Cat — Stv|| tends to zero a.s. as n —> oo, which 
concludes the proof of Theorem ID For Cjv ^ Itv we simply 
need to show ||C^ 2 (Ctv — Stv)C^ 2 || -^4 0 , which follows 
from ||C]f (Ctv-Stv)c4 /2 || < ||Cjv||||Cjv - Stv|| since, by 
assumption, limsupTv ||Ctv|| < oo. 


< 2£v(0) 


■E 


( 1 — £n)n 


1 % •y^n n 

n 2-^i =i 


a? a. 


For the random outliers scenario. Assumption Q] holds a.s. 


-ii 

N 


< OO. 


by virtue of [26!], provided that limsupTv ||DjvC 
Then, the proof of Corollary 0 follows from applying standard 
random matrix arguments to the model of Stv in Theorem Q] 
considered now as a random matrix in both y, : and a, . 


The result may be straightforwardly obtained from, e.g., 125 
Thm. 1] (see Appendix [B] for similar applications). 


Appendix B 

Random Matrix Results 


In this section we list several intermediary results needed 
in Appendix [A] 

Lemma 1. Let Assumptions [T© hold. Define 




1 \ 1 I 7 * \ 

Ftv = - v (7„) Xj-x] + - V ( a jt n ) a^a 


i=i 


3 =1 


and Ftv,(t) = Ftv — -u( 7 n )x.jxt, with 7 „ and otj yn given in 
Theorem [7] Then, as n -A 00, 


max 

1 <i<£ n n 


_ x tp-i x- — -v 


0 . 


Proof: We first need to establish a result on 
Ai (Ftv,(?:)), for which we know that Ai(Ftv,(i)) > 
Ai('u( 7 n)^^^ x jx]). Then, [18, Lemma 1] along with 
Assumption [2] and the boundedness of 7 n show that there 
exists 6 > 0 such that, for all large n a.s., 


.“in , Ai (Ftv, W ) > f. 

1<z<(1 —£ n )n 


(13) 


With this acquired, the outline of the proof is 
divided into two main steps. We first prove that 
max 1 <j< (1 _ £n) „ |-UxjF“* (i) Xi - ^trF^I ^4 0 

using quadratic form-close-to-the trace and rank-one 
perturbation arguments. Then, using [25) Thm 1], we show 


that | -U tr F ^ 1 - 7 „ 


0 . 


The triangle inequality allows us to write 

< 


^ x ! F E) x *-^ ^ F ^ 1 


t-rri-l 


N 


-x'F 


N,(i) 


x,--tr F 

N 


N ,( i ) 


1 

N 


■ tr F 


—ErF 


Nfi) ]\[ 


N 


(14) 

Let us bound the two terms on the right hand side of (fl4l) . 
Denote by E x , the expectation with respect to x, (i.e., con¬ 
ditionally on Fjv, (i) ) and m = 1 { Ai (f n , w )>£} with £ defined 
in O. For the first term, we can apply 1271 Lemma B.26] 
(since x, is independent of so that for p > 2, 


E Xi 

Ki 

K 

i^p 

~ Np / 2 


t-ci-l 


-x F 


x,--tr F 


N'~ l N f i ) 

x tr ( F » 4 ) 


N ” N,{i) 
2 \ P / 2 


V2p p-p 

+ N p/2 Ur N,(i) 


(15) 


for some constant K p depending only on p, with Vf any value 
such that E[|xjj| f ] < v^. Using ■^■trB fc < (-Uti'B)^ for 
B £ C N x A ' nonnegative definite and k > 1 leads to 


E. 


< 


x _LfvF -1 

N i 1 N N ’M 


< 


KiKp 

Np/ 2 

K„ 


(v ^ 2 + V 2 p) 


— tr F,, 2 , 
_/V N ’W 


pi 


p/2 


,.P /2 


£p Np/ 2 


+ 


V2 p 


Np/ 2 - 1 , 


(16) 
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where for the second inequality we have used trB < ||B|| 
for B £ C ;V x N nonnegative definite and the fact that 


KillF 


N,(i) I 


< £ \ which holds from the definition of k,\ 


The bound (IT 6 l) being irrespective of Fjv,(*), we can now take 
the expectation over Fjv,(») to obtain 


E 


Ki 


N tr F N(i) 

p- 

= 0 

f 1 ^ 

^trF ^ 1 - 7n 

N i 


k Np/ 2 ) ' 






(17) 


For the second term in (IT4l) . we can write Fjv,(i) = 
(F w ^ — fljv) + |ljv with Fjv^j) - |Ijv >- 0 and we have 
from 8191 Lemma 2.6] (rank-one perturbation lemma) 


E 


iv trF 


—trF 1 


w.W N N 


< 


Np 


2 \p 


(18) 


From (fl4l >. we can now use Holder’s inequality and the 
bounds (fl7l) — (fl 8 l) to obtain 


E 


J-x^F ^ 1 x_— tr F -1 

jV 4 JV,(i) x * N N 


pi 


Then, we have that 


Pr 


max k. 

1 < 2<(1 — £n)u 


i Ip 


1 


x tF 


= O 


1 


Np / 2 


x,--tr F 


jy * * n 


N 


>c 


(1— £n)n 

< Y, Pr 

2=1 

< (! ~ £ ») n E 




„1/P 




1 


xjF^.-sX,; - -- tr F 


1 


N ‘ N,(i) 1 N 


N 


>c 


^ x^F -1 x_trF -1 

X ' r AT,(i) Xi N llrN 


N 


= O 


1 


Np/ 2 - 1 


max 

l<i<(l—£ n )r 


1 


—x!F 

N 


tpi-t 

N,(i) 


x,- —— tr F 


N 


N 


1 , „-i 1 , ( (I-eMt n) T 

-trF^ i + eN I N + . 


*-N 


0 


According to the definition of j n , e n = c n v{ 7 „) 7 n with 7 n 
the solution to 

1 / (1 — e)u(7 n ) \ 1 

7 " “ N (l+c n u(7„)7n W + "7 

which has been proven to be unique. Altogether, 

7 0 . ( 21 ) 

Combining d20t and dTH) concludes the proof. ■ 

Lemma 2. Let Assumptions 012] hold and define 

1 ( 1 —e„)n 

G N,(i) 4 “ H U (7n) X J X J + - V (“J.") a i a j 

i=i 

with 7 n muf aj, n defined as in Theorem 0 77ien, as n -> oo, 


max 

l<2<e n n 




0 . 


(19) 


Proof: Since Ai(G^ )(i) ) > Ai(u( 7 „)A Y$Li n)n x .? x j)’ 

we can use [18] Lemma 1] along with Assumption 0 and the 
uniform boundedness of 7,, to show that there exists £ > 0 
such that, for all large n a.s. 

min A 1 (G ivw )>^. 

l<2<£ n 72 

Denote re,- = l{x 1 (G Ar Using similar derivations as for 

0 Lemma 3] adapted to the present model, we have 


E 


1 


— a .■ G i \j /SLj * 

N ' N M 


p- 


= O 


f 1 


V Np/ 2 


( 22 ) 


Then 


where we have used (in order) Boole’s inequality, Markov’s 
inequality, and (fl9t . Recall from (fl5l > that the entries of x, are 
required to have finite 2 /i-th order moment and that, by our 
initial assumption, E[|*y | 8 + p] < 00 for some 77 > 0. Then, 
taking p > 4, the Borel Cantelli lemma along with the fact 
that min 1 < i <( 1 _ Eri ) T1 Kj 1 ensure 


max k. 

1 <2<e: n n 


Pr 

< 5 > 

E 


i/p 


jy a t G N,(i) ai 011 


< 


i=1 
£r ,.n. 


A/p 


7y a i G JV,(i) ai 


>c 

> c 


c p 


—af G, T ,., a — 1 
N ‘ N ’( l ) 


0 . ( 20 ) 


= 0 


1 


Np/ 2 - 1 


It remains to show that j n is a deterministic equivalent 
for jj trFy 1 . From (HI and the fact that any subtraction 
of a nonnegative definite matrix cannot increase the small¬ 
est eigenvalue, we have that Ai(Fjv) > £ for all large n 
a.s. Then, we can write Fn = (Fjy — |ljv) + §Ijv with 
liminf,, Ai(Fat — |ljv) > 0 a.s. and we are in position to 
apply lEM Thm. 1] which ensures 


where we used (in order) Boole’s inequality, Markov’s inequal¬ 
ity, and ( l22l >. Taking p > 4, the Borel Cantelli lemma ensures 


max k. 

1 < 2 <e n n 


i/p 




0 


which then proves Lemma 0 using 


mmi<i<e n n Ki 


1 . 


where An = 7 v ( a j,n) a j a j anc l e N is the unique 

positive solution to 


, ,1 ((l-£)vh n ), 

e N = c n v(p/ n )— tr ( ——- -Ijv + A N 

N \ 1 + e^v 


Appendix C 
Asymptotic moments 

In this last appendix, we derive the moments of the deter¬ 
ministic equivalents studied in [25!]. We provide in full the 
generic result, which may be used for independent purposes. 
We first recall jzsl Thm. 1], 


Theorem 2 (Wagner et al., 82511 ). Let Y £ C ,v x have 
independent columns j 7 = HjX,, where x 7 ; £ C Ni has 
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i.i.d. entries of zero mean, variance 1/n, and 4 + 7} mo¬ 
ment of order 0(l/n 2+v I 2 ), and H; 6 <C NxNi such that 
Rj = H^Hf has uniformly bounded spectral norm over n, N. 
Let also Ajv € <C NxN be Hermitian non-negative and denote 
Fat = YYt + A^r. Then, as N, N \,..., N n , and n grow 
large with ratios Ci = Ni/n, and Cq = N/n satisfying 
0 < lim inf n Ci < lim sup„ c, < oo for 0 < i < n, we have 


— tr (Fjv — zl N ) 1 -m N (z )—4 0 
n 


with 


1 / i " i ' 

m N (z) = - tr - V — - rr R i + a n ~ zl n 

n \n '—f 1 + e N ,i(z) 


(23) 


where ejv,i {z ),..., eN,n(z) form the unique solution of 


e N ,j{ z ) = ~ tr Rj 


n 


/ -fl + e N . i (z) 
1=1 ’ v 7 


H - A^v — z I 


•■AT 


such that all eNj(z) are Stieltjes transforms of a non-negative 
finite measure on R + . 

From Theorem[2] the distribution function /\y with Stieltjes 
transform itin(z) is a deterministic equivalent for the eigen¬ 
value distribution of Fjy. We next describe the successive 
moments of the distribution function F\t. This generalizes the 
asymptotic moment results in [29], valid only for A v = 0 . 


Theorem 3. Let h\; be the distribution function associated 
with the Stieltjes transform ( I2JI ). and denote Mn,o,Mn,i, ■ ■ ■ 
the successive moments of Fjy, i.e., = J x p dF'jy. Then, 

(-1 ) p 1 

Mn ’ p = “jv trTp 


with Tq, Ti,... defined recursively from 



k=l 



Tp-iQi-j+iTj 


fk,p +1 — EE ; (p-i + 1 )fk,jfk,i-jPk, P -i 

i= 0 j=o W \JJ 

Pk,p -\-1 — fr [RfcTp_j_i] 
n 

and T 0 = Ijv, fk, o = -1, Pk, o = ^ tr R fc for k <E n}. 


Proof: Follows the same steps as the proof of ll29l Thm. 2] 
with proper modifications to account for A y 7 ^ 0 . ■ 
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